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Abstract 

The study of evolution in the animal world is immensely 
diverse. Evolution of animals can be categorized using data 
mining tools such as Weka. It is one of the freely available tools 
which provide a single platform to combine classification, 
clustering, association, validation and visualization. 
Classification is the arrangement of objects, ideas, or 
information into groups, the members of which have one or 
more characteristics in common. Classification makes things 
easier to find, identify, and study. Taking diversity into 
account the number of species is classified using the attributes 
in weka. The animal kingdom is categorized as vertebrates and 
invertebrates. In this paper animal kingdom data set is 
developed by collecting data from A to Z vertebrate's animal 
kingdom repository. Data set consists of 51 instances with 6 
attributes. The considered attributes are name, weight, size, 
lifespan, origin, and group. The dataset is trained and tested 
using remove percentage filter. Partitioned data set are 
evaluated individually using weka algorithms and the results 
are compared using error rate and accuracy rate. The results 
are compared and verified using Knowledge flow environment. 

Keywords 

Machine Learning; Data Mining; WEKA; Classification; Knowledge 
Flow Experimenter; Animal Kingdom Data Set 

Introduction 

There is a staggering increase in the population and 
evolution of living things in the environment. 
Populations are groups of individuals belonging to the 
same region. Populations, like individual organisms, 
have unique attributes such as: growth rate, age 
structure, sex ratio, mortality rate [2]. The first and 
largest category in population evolution is the Kingdom. 
There are five kingdoms in our environment. There are 
over 1 million different species of animals that have been 
identified and classified and perhaps millions and more 
than that have not been classified. It is mainly 



categorized into two forms vertebrates and invertebrates. 
Vertebrates, the animals in higher order compared with 
invertebrates. Vertebrates are divided into five different 
groups: mammals, birds, amphibians, reptiles and fish. 
We classify living things according to the characteristics 
they share [1]. To study different types of animals, it is 
convenient, classify them by common characteristics. 
The main focus of this paper is to classify the animal 
based on the attributes [18]. Weka is one of the 
frameworks for classification that contains many well- 
known data mining algorithms. Classification in weka is 
made by considering the attributes such as origin life 
span, weight, size, color etc., Although each of these 
groups of animals has unique characteristics, they have 
some common characteristics as well [2]. 

Weka is a machine learning tool which complements 
data mining. An understanding of algorithms is 
combined with detailed knowledge of the datasets. Data 
sets in weka are validation, training and test set. The 
data sets to weka are in three forms 1. Direct ataset. 2. 
Pre categorized dataset 3. Raw data set. In this paper pre 
categorized datasets are provided to weka to analyze the 
performance of algorithms. The performance of 
classification is analyzed using classified instances, error 
rate, and kappa statistics. 

It is widely known that classifiers possess different 
performance measures. Each classifier may 
unknowingly work better in training and testing set. 
The performances of the data sets are tested using 
different algorithms. 

Data Mining Tool: WEKA 

Data Mining is the process of extracting information 
from large data sets through different techniques [3]. 
Data Mining, popularly called as knowledge discovery 
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in large data by analyzing and accessing statistical and 
from data base. In this paper we have used WEKA, a 
Data Mining tool for classification techniques. Weka 
provides the required data mining functions and 
methodologies. The data format for WEKA is MS Excel 
and ARFF formats respectively. Weka a machine 
learning workbench implements algorithms for data 
preprocessing, classification, regression, clustering and 
association rules [4]. Implementation in weka is 
classified as: 

1. Implementation scheme for classification; 

2. Implementation schemes for numeric prediction; 

3. Implemented meta-schemes. 

Learning methods in weka are called classifiers which 
contain tunable parameters that can be accessed through 
a property sheet or object editor. The exploration modes 
in weka allow data preprocessing, learning, data 
processing, and attribute selection and data visualization 
modules in an environment that encourages initial 
exploration of data. Data are pre processed using 
Remove useless filter. It removes the largely varying, 
less varying data in the data sets [8]. Remove percentage 
filter is used for training and testing the data set. 



span, origin [6]. Remove percentage filter is used to split 
the overall data set into training and tested data set. In 
our data set, name is the largely varying attribute. 
Remove useless filter to remove the name attribute in the 
data set. 

Classification Methods 

NAIVE BAYES: 

In Naive Bayes classifier attributes are conditionally 
independent [10]. This greatly reduces the computation 
cost. It counts only the class distribution. 

There are m classes Ci, C2... Cm. With tuples X = (xi, X2... 
Xn), The Classification of such classes is derived using the 
maximum posteriori, i.e., the maximal P (GIX). This can 
be derived from Baye's theorem [16]. P(X) delete 
constant for all classes, only needs to be maximized. The 
goal of this classification is to correctly predict the value 
of a designated discrete class variable given a vector of 
attribute using 10 fold cross validation [24]. Naive Bayes 
classifier is applied to trained and test set and the 
performance is evaluated individually with kappa 
statistics, error rate. 



Data Set 

Records of data base have been created in Excel data 
sheet and saved in the format of CSV (Comma Separated 
Value format) that converted to the WEKA accepted of 
ARFF by using command line premier of WEKA. 
Predominant vertebrate animal data sets are taken for 
classification. The records of data base consist of 6 
attributes, from which 5 attributes were selected using 
remove useless filter which filters the unwanted 
attributes in the data set [23]. Only 60% of the overall 
data is used as a training set and the remaining is used 
as test set [7]. 

Training Set And Test Set 

Full data set is trained using remove percentage filter in 
the pre- process panel. Full data set is again loaded for 
testing the data set. 

Testing set is prepared using invert selection property to 
true values by applying the correct percentage filter. 
Remove Useless filter: it removes the large and less 
varying data in the entire data set. The considered 
attributes are name, average weight, average size, life 
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Fig. 1 Simulation result for training set: Naive Bayes 
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Fig. 2 Simulation result for testing set: Naive Bayes 

SVM: 

Support Vector Machine classifier separates a set of 
objects into their respective groups with a line [14]. 
Hyper plane classifiers separate objects of different 
classes by drawing separating lines among the objects. 
Support Vector Machine (SVM) performs classification 
tasks by constructing hyper planes in a 
multidimensional space [11]. SVM supports both 
regression and classification tasks and can handle 
multiple continuous and categorical variables. Training 
in SVM always finds a unique global minimum [13]. 
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Fig. 4 Simulation result for testing set: SVM 



IBK: 



K-NN is a supervised learning algorithm, where a given 
data set is partitioned into a user specified number of 
clusters, K [9]. Predict the same class as the nearest 
instance in the training set. Training phase of the 
classifier stores the features and the class label of the 
training sets. New objects are classified based on the 
voting criteria [13]. It provides the maximum likelihood 
estimation of the class. Euclidean distance metrics is 
used for assigning objects to the most frequently labelled 
class. Distances are calculated from all training objects to 
test object using appropriate K value [15]. In this paper 
K value is assigned to 1 which shows that the chosen 
class label was the same as the one of the closest training 
object. 
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Fig. 3 Simulation result for training set: SVM 



Fig. 5 Simulation result for training set: IBK 
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Fig. 6 Simulation result for testing set: IBK 
J48 Classifier divides the training objects with a missing 
value. It provides fractional parts proportional to the 
frequencies of the observed non missing values [21]. 
Cross validation is used to split the data sets into 
training and testing. It builds decision trees from a set of 
training and testing data. At each node of the tree, 
classifier chooses one attribute of the data that most 
effectively splits its set of samples into subsets enriched 
in one class or the other. The attribute with the highest 
normalized information gain is chosen to make the 
decision. This algorithm then recurses on the smaller sub 
list of the data sets. 
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Fig. 7 Simulation result for training set: J48 
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Fig. 8 Simulation result for test set: J48 
Performance Evaluation 

10-fold cross-validation technique is used to evaluate the 
performance of classification methods. Data set was 
randomly sub divided into ten equal sized partitions. 
Among the partitions nine of them were used as training 
set and the remaining one is used as a test set. 
Evaluation of performance is compared using Mean 
absolute error, root mean squared error and kappa 
statistics [18]. Large test sets gives a good assessment of 
the classifier's performance and small training sets 
which result in a poor classifier. 

Table 1 Classified instances for animal kingdom data set 



Performance 
\rate 

classifier \ 


Correctly classified 
instances 


Incorrectly classified 
instances 


Training 
set % 


Test 
set % 


Training 
set % 


Test 
set % 


Naive Bayes 


58.0645(18) 


75(15) 


41.355(13) 


25(5) 


SMO 


70.9677(22) 


80(16) 


29.0323(9) 


20(4) 


IBK 


70.9677(22) 


70(14) 


29.0323(9) 


30(6) 


J48 


70.9677(22) 


80(16) 


29.0323(9) 


20(4) 
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Kappa Statistics 

Kappa is a normalized value of agreement for chance 
agreement. 

P(A)-P(E) 

K= 

1 - P(E) 

Where P(A) = percentage agreement 

P(E) = chance agreement. 

If K =1 agreement is perfect between the classifier and 
ground truth. 

If K=0 indicates there is a chance of agreement. 

Table 2 Kappa statistics for training and test set for animal 

KINGDOM 



Classifier 


Kappa statistics 


Training set % 


Test set % 


Naive Bayes 


-0.0372 


0.2 


SMO 


-0.0372 


0.2727 


IBK 


0.2074 


0.2258 


J48 


-0.0372 






Each classifier produces K value greater than (i.e.) each 
classifier is doing better than chance for training set [5]. 
J48 classifier proves there is a chance of agreement. In 
the case of test set IBK classifier alone produce K value 
greater than 0, while other classifiers provide less than 0. 
Therefore compared to both training and test set j48 
works better for training set and IBK works better for 
test set. 

Mean Absolute Error 

The mean absolute error (MAE) is a quantity used to 
measure predictions of the eventual outcomes. The mean 
absolute error is given by 

MAE = ±£f =1 \fi- yi \=^ =1 \e t \ 



The mean absolute error is an average of the 
absolute errors e t = \f t — yj, 

Where fi = prediction 

yi= true value. 

Root Mean Squared Error 

Root mean squared error is the square root of the mean 
of the squares of the values. It squares the errors before 
they are averaged [18] and RMSE gives a relatively high 
weight to large errors. 

The RMSE Ei of an individual program i is evaluated by 
the equation: 




Where ?(*>•) = the value predicted by the individual 
program 

i = fitness case 



Tj =the target value for fitness case /. 

Table 3 Error rate for classified instances 



\Error 
\rates 


Mean 
Error 


Absolute 


Root Mean Squared Error 


Classifier 


Training 
set % 


Test 
set % 


Training 
set % 


Test set % 


Naive Bayes 


0.1718 


0.0881 


0.3609 


0.2714 


SMO 


0.2645 


0.272 


0.3529 


0.3633 


IBK 


0.1451 


0.1635 


0.3024 


0.3186 


J48 


0.1813 


0.1269 


0.3165 


0.2774 



Training a data set generally mininmizes the error rate 
for test set. Error rate for training set is comparatively 
higher than that of the test set. From the above diagram 
IBK has the lowest error rate compared to other three 
algorithms. If both the algorithm has the same mean 
absolute error rate then root mean squared error rate is 
taken into consideration for choosing the best 
classification algorithm. 
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Fig. 9a Error rate for training set 

Testing set has low error rate than the training data set. 
It is clear from the above diagram for the animal 
kingdom test set that Naive Bayes classifier has the 
lowest mean absolute error rate. 




■ Mean Absolute Error 

■ Root Mean Squared error 



Fig. 9b Error rate for testing set 

Confusion Matrix Classification Accuracy 

Classification accuracy is the degree of correctness in 
classification. The degree of correctness is evaluated 
using various classifiers for individual instances in the 
animal kingdom data set. The Larger the training set and 
the higher the classifier accuracy is ; the smaller the test 
set and the lesser the classifier accuracy is Similarly 



larger test set provides a good assessment on classifier 
accuracy [17]. In this paper animal kingdom training set 
is higher than the test set which gives higher accuracy 
rate. Training set contains 60% of the whole data set and 
the remaining is used as test set for classification [21]. 
Remove Useless filter removes the unwanted attributes 
which reduces the time taken to build the model. 



Table 4a: Classification accuracy rate for confusion matrix: 
Training set 



Classifier 

AnimaNv 
Kingdom 


Naive Bayes 


SMO 


IBK 


J48 


Mammal 


22.5806 


70.9677 


32.2581 


29.0323 


Aves 


87.0968 


25.8065 


90.3226 


87.0968 


Amphibian 


87.0968 


90.3226 


96.7742 


25.8065 


Reptile 


90.3226 


96.7742 


80.6452 


90.3226 


Perciforms 


93.5484 


67.7419 


67.7419 


9.6774 



120 



100 




Animal Kingd 



Fig. 10a Accuracy Rate for Training set 
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SMO and IBK have the same accuracy rate performance 
compared to all other classifier algorithms [12]. This 
shows that the two algorithms are effective in classifying 
the training set. J48 provides the least result in 
classification. This classification accuracy rate depends 
upon the number of animal kingdom in the data set. For 
Mammal animal kingdom SMO has the highest accuracy 
rate for confusion matrix. IBK classifier has the highest 
accuracy rate for Aves and Amphibian animal kingdom. 
SMO has the highest accuracy rate for reptile animal 
kingdom. NaiveBayes shows the highest performance 
for percif orms [25] . 



Table 4b: Classification accuracy rate for confusion matrix: 
Test set 



Classifier 

Animal X. 
Kingdom 


Naive 
Bayes 


SMO 


IBK 


J48 


Mammal 


55 


90 


85 


20 


Aves 


90 


50 


90 


45 


Amphibian 


90 


90 


85 


45 


Reptile 


90 


75 


5 


65 


Perciforms 


90 


90 


85 


65 



100 



1 

I 


1 

■ 


i 

■ 


i 


l 

■ 






■ 


■ 


■ 




■ 








■ 














■ 


■ 




■ 








■ 


■ 




■ 








ri 


■ 




■ 








■ 


■ 


■ 







■NaiveBayes 
hi SMO 
"■IBK 
h"J4S 



Mammal Aves Amphibian Reptile perciforms 



Animal Kingdom 

Fig. 10b Accuracy Rate for Test set 

Naive Bayes has higher accuracy rate for Aves, 
Amphibian, Reptiles and Perciforms animal kingdom in 
the above diagram. SMO has higher accuracy rate for 
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Mammal, Amphibian and Perciforms. IBK has higher 
accuracy rate for Aves. J48 has considerable performance 
in Reptile and Perciforms animal kingdom data set. 

Result and Discussion 

The algorithm which has the lowest mean absolute error 
and higher accuracy is chosen as the best algorithm. If 
two algorithms show the same error rate and accuracy 
then the two algorithms are considered to be effective in 
classification. In this classification, each classifier shows 
different accuracy rate for different instances in the data 
set. SMO and IBK have the highest classification 
accuracy. Though both the same accuracy IBK the lowest 
mean absolute error compared to SMO. If both the 
algorithm have the same error rate and accuracy then 
root mean squared error is taken into consideration. 
SMO and IBK have the same correctly classified 
instances. 70.9677% for training set and 80% for testing 
set. Taking mean absolute error and classification 
accuracy IBK is considered as the best classification 
algorithm. Compared with training and test set J48 
classifier is the least performing algorithm for the animal 
kingdom data set. 























i 







Fig. 11 Knowledge flow environment diagram for animal 
kingdom data set for naive bayes 



Data flow diagram for the animal kingdom data set is 
verified using knowledge flow experimenter. The above 
figure shows the flow of the data set from the loader to 
the output. The output obtained from the explorer in 
weka is as such in experimenter and the output is 
verified. 
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Conclusion 

This classification is discussed for evolutionary things in 
the environment. In this paper performances of the 
classifier are discussed for animal kingdom data set with 
respect to accuracy rate and mean absolute error and 
also Root Mean Squared Error. Training set and test set 
performance evaluation is also discussed. The best and 
worst classification algorithms are evaluated for training 
and test set. These best performing algorithms are used 
in case of evolutionary data set. For animal kingdom 
data set IBK is the best performing and J48 classifier is 
the least performing algorithm. This type of 
classification is applicable for population evolution, 
stock market changes data set, vehicle data set with 
various error measures. 
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